{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "f6ec0108-5051-4c39-95ed-c52e183593a5",
   "metadata": {},
   "source": [
    "# Automated Dataset Overview\n",
    "[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/autogluon/autogluon/blob/master/docs/tutorials/eda/eda-auto-dataset-overview.ipynb)\n",
    "[![Open In SageMaker Studio Lab](https://studiolab.sagemaker.aws/studiolab.svg)](https://studiolab.sagemaker.aws/import/github/autogluon/autogluon/blob/master/docs/tutorials/eda/eda-auto-dataset-overview.ipynb)\n",
    "\n",
    "In this section we explore automated dataset overview functionality. This feature allows you to easily get\n",
    "a high-level understanding of datasets, including information about the number of rows and columns, the data types\n",
    "of each column, and basic statistical information such as min/max values, mean, quartiles, and standard deviation. This\n",
    "functionality can be a valuable tool for quickly identifying potential issues or areas of interest in your dataset\n",
    "before diving deeper into your analysis.\n",
    "\n",
    "Additionally, this feature also provides graphical representations of distances between features to highlight features\n",
    "that can be either simplified or completely removed. For each detected near-duplicate group, it plots interaction charts\n",
    "so it can be inspected visually.\n",
    "\n",
    "## Example\n",
    "\n",
    "We will start with getting the titanic dataset and performing a quick one-line overview to get the information."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "aa00faab-252f-44c9-b8f7-57131aa8251c",
   "metadata": {
    "tags": [
     "remove-cell"
    ]
   },
   "outputs": [],
   "source": [
    "!pip install autogluon.eda\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "1d1d55ef-620b-4aae-b3ba-03c052eaa011",
   "metadata": {},
   "outputs": [],
   "source": [
    "import pandas as pd\n",
    "\n",
    "df_train = pd.read_csv('https://autogluon.s3.amazonaws.com/datasets/titanic/train.csv')\n",
    "df_test = pd.read_csv('https://autogluon.s3.amazonaws.com/datasets/titanic/test.csv')\n",
    "target_col = 'Survived'"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "4708b056-d8df-427c-a0c5-fdd28481a98d",
   "metadata": {},
   "source": [
    "To showcase near duplicates detection functionality, let's add a duplicated column: "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "c94de868-74b7-4005-8ffb-cf515ae35355",
   "metadata": {},
   "outputs": [],
   "source": [
    "df_train['Fare_duplicate'] = df_train['Fare']\n",
    "df_test['Fare_duplicate'] = df_test['Fare']"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "34c609f7-b2f5-4aee-8960-3997ca081289",
   "metadata": {},
   "source": [
    "The report consists of multiple parts: statistical information overview enriched with feature types detection and\n",
    "missing value counts.\n",
    "\n",
    "The last chart is a feature distance. It measures the similarity between features in a dataset. For example, if two\n",
    "variables are almost identical, their feature distance will be small. Understanding feature distance is useful in feature\n",
    "selection, where it can be used to identify which variables are redundant and should be considered for removal. To\n",
    "perform the analysis, we need just one line:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "64be5ef9-080f-40e8-8099-a44e9d8ef904",
   "metadata": {},
   "outputs": [],
   "source": [
    "import autogluon.eda.auto as auto\n",
    "\n",
    "auto.dataset_overview(train_data=df_train, test_data=df_test, label=target_col)"
   ]
  }
 ],
 "metadata": {
  "language_info": {
   "name": "python"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}